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Abstract 

We study the convergence of the Regularized Alternating Least-Squares 
algorithm for tensor decompositions. As a main result, we have shown 
that given the existence of critical points of the Alternating Least-Squares 
method, the limit points of the converging subsequences of the RALS are 
the critical points of the least squares cost functional. Some numerical 
examples indicate a faster convergence rate for the RALS in comparison 
to the usual alternating least squares method. 



1 Introduction 

A well-known iterative method for C ANDECOM /PARAFAC (CP) is the Alter- 
nating Least-Squares (ALS) technique. Independently, the ALS was introduced 
by Carol and Chang [5] and Harshman [115] in 1970. It has been extensively 
applied to many problems across various engineering [M] [35] pQ |T3] and science 
[36] [24] fields; see the survey papers [23] [11] and the references therein. For 
example, Beylkin and Mohlenkamp [5j [6] utilizes ALS to compute optimal sep- 
aration rank for certain operators like inverse Laplacian and the multiparticle 
Schrodinger equation to reduce computational complexity. In a more recent ap- 
plication, Doostan et al. [H] has implemented ALS to study complex systems 
modeled by stochastic PDEs. 

Its widespread success can be attributed to the simplicity of the method. 
Moreover, Bro et al. [35] [37J found that the ALS method gives superior quality 
solutions with fewer memory and time requirements than the other CP methods. 
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Despite its success, the ALS has some drawbacks. For example, initialization of 
the factor matrices, collinearity in the factor matrices or degeneracy problems 
may require a high number of iterations for the ALS method to converge. This 
slowed convergence characterized by a flat curve in a log error plot is referred 
to as the swamp. Swamps can be present in the non-degenerate and degenerate 
cases. The degenerate case is a more challenging problem; see [315] [H] for some 
regularization techniques for the degenerate swamps. 

Here we address the non-degenerate case. There have been several meth- 
ods which address the issues of the swamp occurrences in the non-degenerate 
case. For example, line search schemes [33] [25] have been used to accelerate 
the ALS algorithm. An entirely different approach by De Lathauwer, De Moor 
and Vandewalle obviates the swamp issues by considering a simultaneous ma- 
trix diagonalization for CP decomposition [10] [9j. Paatero [31] have applied 
regularization to a gradient descent based method for CP. 

In this paper, we analyze the Regularized Alternating Least-Squares (RALS) 
method introduced by Navasca, Kindermann and De Lathawer |28j . The imple- 
mentation of RALS is simple; it is no more complicated than the ALS algorithm. 
The cost functional of the RALS penalizes the difference between the current 
and previous factor iterates with a regularization parameter. Unlike the tensor 
regularization method found in 30J [27 , RALS is an unconstrained optimiza- 
tion problem since there is no uniform constraint in the penalty terms that are 
sequentially changing at each iteration, and thus, the sequences of limit points 
of RALS can be unbounded. Hence, RALS does not address the degeneracy 
problem; i.e. RALS will not find a critical point if the original ALS functional 
does not have a critical point. 

The study of the convergence analysis of ALS and RALS is facilitated by 
an optimization framework; ALS is the nonlinear block Gauss-Seidel (GS) and 
RALS is the nonlinear block proximal point modification of GS (PGS) for CP 
tensor decomposition. What we have shown is that if a limit point exists, then 
it is a critical point of the functional. More specifically, we study the ALS 
non-degenerate swamps by analyzing how RALS removes, if not, shortens the 
swamps. Furthermore, our convergence analysis brings attention to fact that 
the RALS functional has a weakened assumption than the requirement of the 
ALS functional. This finding sheds light on the swamps in the non-degenerate 
case in the ALS method. 

1.1 Organization 

Beginning with Section [2] we give some preliminaries which include basic defini- 
tions of rank-one tensor and CP decomposition. Section [3] reviews the classical 
ALS for third-order tensors and includes a discussion on the ALS swamp through 
an example. Section[4]is the main section where we introduce the RALS method 
and show that if the sequence obtained from RALS algorithm converges, then 
the limit points are the critical points of the original ALS algorithm. In Section 
5, we provide a numerical comparison study of ALS and RALS with several 
data sets. Lastly, we make some concluding remarks in Section 6. 
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2 Preliminaries 



We denote the scalars in K with lower-case letters (a,b, . . .) and the vectors with 
bold lower-case letters (a, b, . . .). The matrices are written as bold upper-case 
letters (A, B, . . .) and the symbol for tensors are calligraphic letters (A, B, . . .). 
The subscripts represent the following scalars: {A) i j k = a^fe, (A)y = a^-, 
(a)j = di and the r-th column of a matrix A is a r . The matrix sequence is 
{A*}- 

Definition 2.1 The Kronecker product of matrices A and B is defined as 



B = 



cinB CI12B 

fl2lB &22B 



Definition 2.2 TTie Khatri-Rao product is the "matching columnwise" Kro- 
necker product. Given matrices A e M. IxK and B e R JxK , their Khatri-Rao 
product is denoted by A©B. The result is a matrix of size (IJ x K) defined by 

A©B = [Ai®Bi A 2 ®B 2 ...]. 

Definition 2.3 (Mode-n fibers) A mode-n fiber of an Nth order tensor is a 
vector defined by fixing all indices but the n-th one. 

For example, a matrix column is a mode-1 fiber and a matrix row is a modc- 
2 fiber. Third-order tensors have column (mode-1), row (mode-2) and tube 
(mode-3) fibers, denoted by x : jk, Xi : k and xy : respectively. 

Definition 2.4 (Mode-n matricization) Matricization is the process of re- 
ordering the elements of an Nth order tensor into a matrix. The mode-n ma- 
tricization of a tensor T € ^ix^x-x^w { s denoted by TV n ) and concatenates 
the mode-n fibers to be the columns of the resulting matrix. 

If we use a map to express such matricization process for any TVth order 
tensor T € jj-rix^x-x/^ ^at is, the tensor element (ii, i 2 , ■ . ■ , In) maps to 
matrix element (i n ,j), then there is a formula to calculate j: 

N fc-1 

j = 1 + ^2(ik - l)Jk with J k =Y[l m - 

fc=l m=l 

So, given a third-order tensor X e R IxJxK 7 the mode-1, mode-2 and mode-3 
matricizations of X are: 

X(i) = [x :11 , . . . ,X.J 1 ,X :12 . . . ,x :J2 , . . . ,x. 1K , • • ■ ,X.jk], 
X(2) = [Xl:l, . . . , Xl : l, X i:2 • • • , Xl :2 , • • • , Xl :K , • • • , Xl :K ], 
X( 3 ) = [Xn : , . . . ,Xn : ,Xi2: . . . ,Xi 2: , . . . ,Xi J: , . . . ,Xij ; ]. 
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Definition 2.5 (Rank-one tensor) An Nth order tensor T G M. hxl2X '" xlN 
is a rank-one tensor if it can be written as the outer product of N vectors, i.e., 

T = a( 1 W 2 )o...oaW 

where sS^ G M /rXl ,l < r < N. The symbol " o" represents the vector outer 
product. This means that each element of the tensor is the product of the cor- 
responding vector elements: 

thi 2 -i N = 41 a it ■ ■ ■ a i^\ f or al1 1 < < In- 

3 ALS and Nonlinear Block Gauss-Seidel Method 

In 1927, Hitchcock [20] [H] proposed the idea of the polyadic form of a ten- 
sor, i.e., expressing a tensor as the sum of a finite number of rank-one tensors. 
Currently, this decomposition is called the C ANDECOMP /PARAFAC (CP) de- 
composition. The Parallel Factor Decomposition (PARAFAC) first appeared in 
[TU] in the context of psychometrics. Independently, [5] introduced this decom- 
position as the Canonical Decomposition (CANDECOMP) in phonetics. The 
work of Kruskal in 1977 [25] [26] provided a sufficient condition, 

1 + J + K >2R + 2, 

where /, J and K denote the k-rank (defined as the maximum value k such that 
any k columns are linearly independent in a matrix) of matrices A, B and C 
respectively, for uniqueness up to permutation and scalings of CP. Later on, De 
Lathauwer [TO] and Jiang and Sidiropoulous [22] gave new sufficient conditions 
for uniqueness by assuming only one full-rank factor with the new bound, 

R(R — 1) 1(1 - 1) J(J - 1) 

2 ~ I ' 

Also, some constraints on the factor matrices of the CP are considered by re- 
quiring all the columns in each factor matrix to be orthonormal. This condition 
is useful in applications like independent component analysis (ICA) [7]. 

In terms of numerical methods for computing CP decomposition, there are 
several methods (see e.g. [37]) to solve CP decomposition of a given tensor. The 
ALS method is the most popular technique. We will discuss the connection of 
ALS to the nonlinear block Gauss-Seidel (GS) method [3] [UJ. This connection 
is important since it is a well-known fact that the GS method does not necessar- 
ily converge, leading us to further discuss some convergence results of the ALS 
algorithm. 

3.1 Alternating Least Squares 

For the simplicity of the exposition, we looked at third-order tensors, but all 
the analysis holds for higher-order tensors. For a given third-order tensor T G 
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alxJxK 



its CP decomposition is 
R 

T~^a r ob r oc r . (3.1) 

r=l 

The factor matrices are the combination of the vectors from the rank-one 
components; i.e., A = [ai,a 2 ,-- , a R ] € R IxR , B = [bi,b 2 ,-- ,b R ] e R JxR 
and C = [ci,C2, • • ■ , cr] € M^ xfl where R is called the rank of the tensor T 
denoted by R = rank(T). 

The problem we want to solve is the following: given a third-order tensor 
T £ M. IxJxK , compute its CP decomposition with R components of rank-one 
tensors that best approximates T, i.e., 



\T — T\\ 2 p, where T = a r o b r o c r , (3-2) 

' r=l 

and || • ||p is the Frobenius norm. The problem is equivalent to 



minimize 

f 



R 

jnin || T — ^""^ a r o b r o c r ||^ (3-3) 



A,B.C 



with respect to factor matrices A, B and C. 

By using the Khatri-Rao product and tensor matricization, (3.1) can be 
written in three matricized forms: 



T 



T(i)«A(C0B) 
T (2) «B(C0A) T 
T (3) « C(B A) 



T 



Then by fixing all factor matrices but one, the problem reduces to three 
coupled linear least-squares subproblems. Thus, ALS solves three least-squares 
subproblems to obtain the factor matrices through these subproblems: 

A fe+1 = argmin||T (1) /XJK -A(C fe 0B fe ) T ||2,, 

B fe+1 = argmin||T (2) JX/ ^-B(C fc 0A fc+1 ) T |||,, (3.4) 

BeR Jxi? 

C fc+1 - a rgmin||T (3) Xx "-C(B fc + 1 0A fc + 1 ) T ||| ) 

where T {1) IxJK , T (2) JxIK and T (3) KxIJ are the mode-1, mode-2 and mode-3 
matricizations of tensor T. So starting from an initial guess A , B°, C°, the 
ALS approach fixes B and C to solve for A, then fixes A and C to solve for 
B, and then fixes A and B to solve for C. This process continues iteratively 
until some convergence criterion is satisfied. Therefore, this method translates 
the original nonlinear minimization problem to three subproblems where each 
one is just a least-squares problem. 
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3.2 Block Nonlinear Gauss-Seidel Method 

In this section, we want to introduce the nonlinear block Gauss-Seidel method 
[5] [3] [3] [T7] [T8], a technique that is used to find a minimizcr of a nonlinear 
functional. We will see that ALS is a special case of GS. 
Consider such a problem: 

minimize /(x) (3-5) 
subject to xeI = I 1 xI 2 X'-' X m C R" , 

where / be a continuously differentiable function from R™ to R and X is the 
cartesian product of closed, nonempty and convex subsets Xi C R™ ; , for i = 

m 

1, . . . , m, with rii = n. If the vector x G R" is partitioned into m component 

i=l 

vectors Xj € R™ ; , then we can consider / is a function from R™ 1 x R™ 2 x ■ • • R" m 
to R with 

/( x ) =/(xi,x 2 ,-- - ,x m ). 
The minimization of the block nonlinear Gauss-Seidel method for the solution 
(3.5) is defined by the iteration, 

Xi^ 1 = argmin /( Xl fc+1 , . . . , Xi_i* +1 , y ; , x i+1 fc , . . . , x m fc ), 

yieXi 

which in turn updates the components of x, starting from a given initial guess 
x°el and generating a sequence {x fe } = {(x 1 fe ,x 2 fe , . . . ,x m fc )}. 

We introduce the following definitions to facilitate our discussion on the 
connection between GS and ALS. 

Definition 3.1 (Vectorization) The vectorization of a matrix 

M=[mi,m J| ." ,m n ] GR mxn , 

where nii is the i-th column of M, is denoted by uec(M) which is a vector of 
size (mn) defined by 

mi 



uec(M) 



m 2 



From the PARAFAC formulation (3.2 ) and the definition of rank-one tensor, 
the cost function we want to minimize is 

K J I R 

\\T - TTf = E EEfe' - E *irb jr c kr ? = /(A, B, C), 

fc=l j=l i=l r=l 

where the cost function is a function s.t. / : R™ — > R, n = (I + J + K)R. Let 
x = vec([vec(A), uec(B), uec(C)]) € R n , it is obvious that 

K J I R 

/(x) = /(A, B, C) = E E Efe* - E airbjrCkrf. 

k=l j=l i=l r=l 
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Let vec(A) = x x € R IR , vec(B) = x 2 £ M Jfl and wec(C) = x 3 € K A " fl so that 
we partition the vector x <E R™ into 3 component vectors Xi £ M. ni , i — 1,2,3. 
m = IR, ri2 — JR and 113 — KR. It follows that x = xi x x 2 x X3 £ 
E"i x x = Thus, the CP decomposition can be reformulated to 
the following problem: 

minimize /(x) (3-6) 
subject to x £ R ni x M" 2 x E" 3 = i". 

From the ALS algorithm, the updates are in terms of components of x, start- 
ing from a given initial point x° = vec([vec(A°), wec(B°), wec(C )]) £ R n and 
generates a sequence {(xi fc , x 2 fc , x 3 fc )} by the following: 

Xl k+1 = argmin/(yi,x 2 fe ,x 3 ' £ ), 
x 2 fe+1 = argmin/(xi fc+1 ,y 2 ,x 3 ,c ), 

y 2 eR'«2 

x 3 fc+1 =argmin/(x 1 fc+1 ,x 2 fe+1 ,y 3 ). 

y 3 6R" 3 



Notice that this is the exact GS method to solve the problem (3.6 1. Therefore, 
the ALS algorithm is the block nonlinear Gauss-Seidel method for solving the 
CP decomposition of a given tensor. 

3.3 Some Analysis about ALS 

Since we already know that the ALS method coincides with the GS method, we 
can bring some GS results to analyze the ALS algorithm. 

Definition 3.2 (Critical Point) Let f : X — > K, X C W 1 is a continuously 
differentiable junction, a critical point of f is a point x£l such that 

V/(x) T (y-x) >0, Vyel, (3.7) 

where V/(x) £ R™ denotes the gradient of f atx and V/(x) T is the transposition 



of it. If X = R" or ifx is an iterior point of X, then the condition (3.7) reduces 



to the stationarity condition V/(x) = of unconstrained optimization. 
Theorem 3.3 (Optimality Condition) (a) Ifx is a local minimum of f over 



X, then it satisfies the optimality condition (3.1), i.e., 
V/(x) T (x-x) > 0, Vxei. 

(b) If f is convex over X, then the condition of part (a) is also sufficient for 
x to minimize f over X . 

If X — W l or ifx is an interior point of X, then the condition reduces to 
V/(x) = 0. 
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Definition 3.4 (Limit Point) We say that a vector x £ M. n is a limit point 

of a sequence {x' c }^° =1 inl" if there exists a subsequence of {x k }^ =1 that con- 
verges to x. 

Definition 3.5 (Convex Function) A real-valued function f(x) defined on a 
convex subset is called convex if for any two points x± and Xi, in its domain 
and any t € [0, 1], 

f(tx x + (1 - t)x 2 ) < tf(xi) + (1 - t)f(x 2 ). 

If furthermore, 

f(t Xl + (1 - t)x 2 ) < tf( Xl ) + (1 - t)f(x 2 ), 

X\ 7^ x 2 , then f is strictly convex. 

Definition 3.6 (Quasiconvex Function) A function f : S — > K defined on 
a convex subset S of a real vector space is quasiconvex if whenever x, y G S 
and A <E [0,1], then 

f(Xx + (l-X)y)<max(f(x),f(y)). 

If furthermore, 

f(Xx + (l-X)y)<max(f(x),f(y)), 

x 7^ y, then f is strictly quasiconvex. 

Consider the function f in (3.5), which is defined on a subset X = X\ x 
X 2 x ■ ■ ■ x X m , we say that f is quasiconvex with respect to 6 1; on X if for 
every i£X and jji G X i7 we have 

f(xx, . ..,tXi + (1 - t)yi, . . .,x m ) < max{f(x),f(xi,. ..,yi,.. . ,x m )}, 
for all t 6 (0, 1). If furthermore, 

f(xi, . . . ,tXi + (1 - t)y u . . .,x m ) < max{f{x),f(xi,. . . ,y t , .. . ,x m )}, 

yi =/= Xi, then f is strictly quasiconvex. 

The convergence of the GS method is studied under different assumptions 
(see e.g. [3] [T7] [IE]). 



Theorem 3.7 (see |17| ) Suppose that the function f in (3.5) is strictly qua- 
siconvex with respect to Xj on X, for each i = l,...,m — 2 in the sense of 
Definition 3.6 and that the sequence {x fe } generated by the GS meth od has limit 
points. Then, every limit point x of {x fc } is a critical point of (3.5). 



Theorem 3.8 (see |3j) Let f be the function in (3.5). Suppose that for each 
i and x € X , the minimum of 



min /(xx, . . . , Xi_!, £, x i+ i , . . . , x m ) 



is uniquely attained. If x fc is the sequence generated by GS, then every limit 
point of x fc is a critical point. 
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(a) Quasiconvex function, but not convex 



(b) Not a quasiconvex function 



Figure 1 



These theorems show that the GS method can produce a converging sequence 
with limit points that are critical points of the problem. But, in general, the GS 
method may not converge, in the sense that it may produce a sequence with limit 
points that are not critical points of the problem. A counterexample of Powell 
[32] (see also |17j ) shows that for a non-convex function that is component-wise 
convex but not strictly quasiconvex with respect to each component, its limit 
points need not be critical points. 

Comparing these convergence results for the GS method with the least- 
squares cost functionals, we observe that neither of the hypothesis in [3] or [T7] 
are satisfied. Indeed, the least-squares cost functional is convex (even quadratic) 
in each component and therefore, quasiconvex. However, in the case that the 
Kathri-Rao product of two factor matrices involved is rank deficient, then the 
least-squares function will not be strictly quasiconvex (see the following propo- 
sition). 

Proposition 3.9 Let f{x) = ||Ax - b|| 2 where A £ R mx ", m > n, x G R nxl 
and b € K mxl . If A is rank deficient, then /(x) is not strictly convex. 

Proof. Since A is rank deficient, then assume rank(A) = r which implies 
that dim(NulA) = n — r. Take x, x <E NulA where x ^ x. Then, according 
to the definition of a strictly convex function, for any t 6 [0, 1], /(fx + (1 — 
t)x) = || A[tx + (1 - i)x] - b|| 2 = ||b|| 2 and f/(x) + (1 - t)f(x) = ||b|| 2 . Thus, 
/(tx+(l-t)x)=t/(x) + (l-t)/(x). □ 



It follows from the proposition above that / is not a strictly quasiconvex 



function since f(tx + (1 — t)x) = /(x) = /(x). Thus from Theorem 3.7 a limit 
point of the ALS sequence is not guaranteed to be a critical point. 

The main difficulty in proving the convergence is the lack of strict (quasi) convexity 
in the case of rank deficient Khatri-Rao products. This indicates that one reason 
for the occurrence of swamps, namely if the Khatri-Rao products of at least two 
of the three iteration matrices is almost singular, the associated least-squares 
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functional will be flat. Thus, we can expect slow convergence, verified by the 



plots in Figures [2a] and [2b] We observe that the region of a swamp in the ALS 
method (the plateau in the left convergence plot) is strongly correlated with 
very small singular value of the Khatri-Rao product of B' and C fe . 



1000 2000 3000 4000 5000 0000 7000 8000 9000 10000 




1000 2000 3000 4000 5000 0000 7000 8000 9000 10000 



(a) Fifth order tensor (b) Smallest singular values of C k B fc 

Figure 2 



4 The Regularized Alternating Least- Squares 

In the last section, we analyzed why ALS do not always converge through the 
properties of the GS method while examining RALS 28J , a proximal point mod- 
ification of the Gauss-Seidel method (PGS) (see [3] [IT]) for tensor decomposi- 
tion. The analysis provides some explanations on why RALS performs better 
than than ALS and decreases the high number of ALS iterations if there are 
swamp occurrences. 

4.1 Regularized ALS 



The regularized ALS solves the same problem (3.2). It recasts the main problem 
to three subproblems for a third-order tensor. But RALS has an extra term in 
each subproblem. Therefore, in order to solve the problem: 

R 



mm T— > a r ob r oc r p , (4.1) 

A.B.C 11 Z— < " i " 

r=l 



with respect to factor matrices A, B and C, for a given third-order tensor T, 
here are RALS subproblems: 

A fe+1 = argmin||T (1) /XJK -A(C fc 0B fc ) T |||i + A fc ||A fe - A|||, 
Am IxR 

B k+1 = argminllT^^^-B^OA^fllli + AfcllB^-BUli, (4.2) 

C fe+1 = argmin ||T ( 3) Kx/J -C(B fe + 1 0A fe+1 ) T ||^ + A fc ||C' £ -C||^ 

cm KxR 
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where Afc > is the regularization parameter. The regularization terms Afc|| A fe — 
A|||,, AA;||B fc — B|||, and Afc||C & - C\\ 2 F are the fitting terms for the factors A, 
B and C. 

In fact, RALS also gives us three least-squares subproblems. For example, 



the first subproblem in (4.2) actually is equivalent to solving a least-squares 
problem: 



(C fe 0B*) ^ 1 



X k -I RxR 



X k ■ (A fe ) T 



which is different from the least-squares obtained from ALS, that is 



(4.3) 



(C fe ©B fe )X = T (1) T . (4.4) 



RALS-Algorithm 


procedure CP-RALS(A", R, N, A) 


give initial guess A e R IxR , B° G R JxR , C° e R KxR , A 


for n = 1, . . . , N do 




[(C n 0B«); X n I RxR ] G R(JK+R)xR 


s <- 


[K w T ;X n (A^]eR( JK + R ^ XI 


A n+1 


W/S — — % solving least squares to update A 


W <r- 


[(C n A" +1 ); X n I RxR ] € RVK+^xR 


S <- 


[X (2) T ;A n (B") T ]el( /K +«) XJ 


B n+1 


«— W/S — - % solving least squares to update B 


W <- 


[(B" +1 A n+1 ); A n I fixR ] e R( IJ + R *> xR 


s <- 


[X (3) T ;A„(C«) T ]G]R( /J +^^ 


C n+1 


<— W/S — - % solving least squares to update C 


An+1 


<— 6 ■ A n — — % update regularization parameter 


end for 




return A 




end procedure 



The number of iteration TV is set to a large number; otherwise a convergence 
stopping criterion can be used. 

Our notion of a regularized ALS can be misleading. In the usual regular- 
ization setting, the minimizer (critical point) of the regularized cost functional 
is found. The additional regularization terms in |4.2| penalize the difference be- 
tween the previous iterates, which themselves need not be a bounded sequence. 
Although there is regularization in each step, the method imposes no uniform 
constraint for all k on the matrices A fc ,B fc , C k . For this reason, the cost func- 
tional |4.2| is not a constrained optimization problem, i.e. no global optimal 
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solutions are guaranteed. In particular, this approach does not address the de- 
generacy problem. Moreover, the limit points of the RALS algorithm will be 
shown as the critical points of the least-squares functional |4.1| and not of the 
regularized version. 

The RALS method differs from the Tikhonov regularization for CP decom- 
position considered in [37] [3D] in the following way: the Tikhonov functional 
minimized is 



R 

T — a r o b r o c, 

r=l 



A(||A|| F + ||B|| F + ||C|| F ). (4.5) 



If ALS is applied to this regularized functional, then the corresponding sub- 
problems are 



(4.6) 



Observe that the penalization terms, ||A|| F , ||B|| F and ||C|| F , are independent 
of k, which are viewed as uniform constraints on the norm of the matrices. From 



A fe+1 


= argmin 


T (l) 


JK 


- A(C fc ©B ,c ) T || F 


+ A||A|| 2 F , 




Am IxR 








B fe+1 


— argmin 


I 1 (2) 


IK 


-B(C fc 0A fc+1 ) T | 


l + A||Bf F , 




B£l JxH 








C fe+1 


= argmin 


HT(3)* 


XlJ 


-C(B fe+1 ©A fe+1 


) T |^ + A||C|| F 




CGR KXJi 









[27] , this constrained optimization problem 4.5 always has a globally optimal 
solution. However, the price to pay here is that the optimal solution is not a 
critical point of the |4.1[ but it is a critical point of the regularized functional. 



4.1.1 Proximal Point Modification of the Gauss-Seidel (PGS) Method 

In Section 3, we have shown that the ALS (GS) method may not converge 
due to a requirement of convexity or quasiconvexity assumption to guarantee 
convergence. Thus, a modification of GS is considered by adding an extra term 
in each iteration: 

Xi fc+1 = argmin /( Xl fc+1 , . . . ,yi, . . . ,x m fc ) + ^Hy; - x^] 2 . 

This method is called the proximal point modification of the GS (PGS) method 
(see [3J, [T7]). It is also referred as partial proximal minimization [4 . The PGS 
formulation lead to a weakened assumption for convergence to critical points. 

Definition 4.1 The GS and PGS methods are well-defined if every subproblem 
has solutions. 

Proposition 4.2 (Convergence proposition of PGS [1TJ ) Suppose that the 
PGS method is well defined and that the sequence {x fe } has limit points, then 



every limit point it o/{x } is a critical point of problem (3.5) 
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Recall that in Section 3.2 we showed that the ALS method is the GS method 
for CP decomposition with respect to the factor matrices A, B and C. Similarly, 
through vectorization of the three factor matrices, we have 

X!^ 1 = argmm{/(y 1 ,x 2 fc ,x 3 fe ) + A fc ||xi* -yi|||}, 

yi6R"i 

x 2 fc+1 = argmin{/(x 1 fe +\y 2 ,x 3 fc ) + A fc ||x 2 fe -y 2 || F }, 
y 2 GR"2 

x 3 fc+1 = argmin{/(x 1 fc + 1 ,x 2 fe + 1 ,y 3 ) + A fe ||x 3 fc - y 3 || 2 F }. 
Thus the regularized ALS is the PGS method for CP decomposition. 

4.2 Convergence Result of the Regularized ALS 

In this section, we will show that the converging sequence obtained from RALS 
method leads to a critical point. This characterization is not true for the ALS 
algorithm as we have seen in Section [3] where converging sequence of factor 
matrices {(A fe , B fe , C fc )} cannot guarantee that the limit point is a critical point 
(local minimum). 

We adapt the proposition in Section 7 in |17j to our problem. 

Theorem 4.3 Suppose that the sequence {(A fe , B fc , C k )} obtained from RALS 
has limit points, then every limit point (A, B, C) is a critical point of the Prob- 
lem (4-1). 

Proof. Recall the vectorization in Section [3] which allows us to re-express 
{(A fe , B fc , C fe )} as (xi,x 2 ,x 3 ) and the cost function as 

K J I R 

/(X1,X 2 ,X 3 ) = ^2^2^2{tijk - ^airbjrCkr) 2 
k—1 j—1 i—1 r—1 

where x x = vec(A) € R IR , x 2 = vec(B) e U JR and x 3 = vec(C) € R KR . Let 
{ x ™ fc }fcLi — {( x i" fc > x 2 nfc ) x 3 nfc )}fcLi be the converging subsequence of {(x! fe , x 2 fc , x 3 
which has the limit point (xi,x 2 ,x 3 ). 

The subproblem in the RALS method provides the following inequality: 

/( Xl " fc+1 , x 2 n * , x 3 "" ) < /( Xl " fc , x 2 "* , x 3 "« ) - X nk || Xl " fc+1 - Xl "-|| 2 - (4-7) 

Using the inequality above repeatedly, we have 

fix^ 1 ,*^ 1 ,*^ 1 ) < /(x a n »+ 1 ,X a B '+ 1 ) X s fl ') 

< /(x 1 n * +1 ,x 3 n »,x a n *) (4.8) 

< /(x 1 n »,x 3 "»,X3 n *). 

By the Squeeze Theorem, the continuity of / and (xi™ fc , x 2 " fc , x 3 " fc ) — > (xi, x 2 , x 3 ) 
as k — ¥ oo, then we have the following 

lim /(x 1 "*+ 1 ,x 2 "*,x 3 " fc )= lim /(x 1 nfc ,x 2 " t ,x 3 "'=) = /(x!,x 2 ,x 3 ). 

k—too k— yoo 
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Now taking the limits in (4.7| for k — > oo on both sides, we have 
lim ||xi" fc+1 -X!^" 2 

which implies 



k— f oo 







hm (x 1 n *+ 1 ) x a '* Xs w *) = (xx.xa.xg). 

k— >oo 



Similarly, we obtain 



hm (x 1 ^ +i ,x 2 "'=+ 1 ,X3"-) = (xx.xa.xg). 

k— >oo 



(4.9) 



(4.10) 



(4.11) 



Since every RALS subproblem is well defined, then each point in the subse- 
quence satisfies the corresponding optimality condition (Theorem 3.3); i.e. 

Vi/(x 1 »*+ 1 ,x a "*,x B n *) + 2A nfc ( Xl "* +1 - Xl ™*) = 0, (4.12) 
V 2 /(x 1 " fc+1 , x 2 " fc+1 , x 3 n *) + 2A„ fc (x 2 "* +1 - x 2 «* ) = 0, (4.13) 
V 3 /(xi" fc+1 ,x 2 " fc+1 ,x 3 n '= +1 ) + 2A nfc (x 3 " fc + 1 -X3"'-) =0. (4.14) 

Then, taking k -» oo in ( |4.12| - |4.14) , using the arguments in ( 49| , ( 4.10[ ), ( 4.11| ) 
and the continuity of V/, we obtain 



Vj/(xi,x 2 ,x 3 ) = 0, 



1,2,3. 



Thus, this proves that the limit point (xi,x 2 ,X3) is a critical point of the 
cost function /(x!,x 2 ,x 3 ). Furthermore, we obtain 



V A /(A,B,C) = 0, 
V B /(A,B,C)=0, 
V C /(A,B,C) = 0. 



(4.15) 



through the inverse mapping of the vectorization. Therefore, (A,B,C) is a 
critical point. □ 



Here are some remarks: 



1. Following from the discussion and the theorem above, we showed that 
RALS method solves the same ALS cost function. Moreover, we have 
proved that the limit point obtained from RALS is a critical point of the 
original minimization problem of \\T — T\\ 2 F . 

2. The main theorem above solves the CP decomposition on the whole space, 
so we use the optimality condition, V/(xi,x 2 ,X3) = 0. If the solution 
is not in the whole space, namely, in the problem of non-negative tensor 
decomposition, then the optimality condition,V/(xi,x 2 ,X3) T (y — Xi) > 0, 
must be used. 
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3. For the ALS method, under the same assumption in Theorem 4.3 the the- 
orem may not be true. From the assumption, we know that the sequence 
{(A nfc , B" fc , C" fc )} converges to a limit point (A,B, C), but we cannot 
obtain the sequences {(A n <*+\ B"*, C n ")} and {(A ,lfc+1 , B" fe+1 , C™*)} to 
converge. Furthermore, we also cannot prove that these two sequences 
converge to the same limit point (A, B, C). 



4. The optimality conditions in (4.15 ) are equivalent to the normal equations 
of the subproblems: 

T (1) /XJif (C0B) = A(C0B) T (C0B), 
T (2) JxM (C0 A) =B(C©A) T (C0 A), 
T (3) Kx "(B0A) = C(B A) T (B A). 

5. Thcorem |4.3| is a conditional convergence proof, impinging upon the exis- 
tence of the ALS limit points. Thus this result does not address the de- 
generacy problems. Analysis of the existence of the limits of the (R)ALS 
is a challenging problem that would require a careful study of the degen- 
erate cases of the CP decomposition. The regularization [4~5] considered by 
Paatero |30j is a good approach in finding approximation to the degenerate 
case, but the solutions satisfy the regularized cost functional and not the 
original least-squares functional. Moreover, a similar conditional conver- 



gent analysis [T7] can be established for the regularized functional 4.5 In 



fact, if A > 0, then the cost functional 4.5 will be component-wise strictly 



quasiconvex. Thus Theorem 3.7 applies and hence, the limit points of |4.6 



will be critical points of the regularized functional |4. 5 



5 Numerical Comparison of the ALS and RALS 
Algorithms 

In this section, we compare the performance of RALS against ALS. We give three 
examples of third-order tensor CP decomposition to demonstrate the swamp 
shortening property of the iterated regularization and one example of large real 
third-order tensor data. 

5.1 Example I: Initial Factors Dependent Swamp 

Let the matrices 





"1 


2" 




" 2 


1 " 




"3 


1 


A = 


2 


1 


, B = 


-1 


3 


, c = 


1 


2 




3 


2 




1 


-1 




2 


2 



be the three factor matrices of a third-order tensor T € R 3x3x3 of rank- two: 
T = sli o bi o ci + a 2 o b 2 o c 2 . 
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(a) right initial guess (b) random initial guess 

Figure 3 



2 F versus the number of 



In the two figures, the plots show the error \\T —T es t 
iterations it takes to obtain an error of 1 x 10 -5 , where T es t denotes the obtained 
tensor after every iteration. The red line denotes ALS method while the blue 
one is RALS in each picture for the same initial guess. 



3 a 



Two initial guesses are compared in Figures 3a 3b i n ter ms of ALS. In Figure 



A, B 



B 



and C° = C. For Figure 



ated 3x2 matrices as the initial factors. With {A,B 



3b 



we randomly gener- 



, C} as the initial 



guess, ALS takes 55 iterations to reach an error within 10 -5 while it takes 1988 
iterations by using random initial guess. Observe in Figure [3a] that ALS and 
RALS have the same convergence speed and take the same iterations to reach 
an error within 10~ J . However, in Figure 3b RALS can reduce the swamp by 
only taking 206 iterations in comparison to that of 1988 ALS iterations. In some 
cases, randomly generated factors can lead to swamp in the implementation of 
the ALS. However this swamp phenomena induced by the initial factors is not 
observed if the RALS method is used. 



5.2 Example II: Rank Specific Swamp 

Let the matrices 



A = 



1 2 3 

2 12 



B = 



-1 



1 1 

3 1 



C = 



3 1 
1 2 



be the three factor matrices of a third-order tensor T = a r o b r o c r e 



2x2x2 



that is a rank-three tensor. Rank-two (Figure 4b) and rank-three (Figure 4a I 
approximations are calculated with the following initial matrices: 



0.1679 0.7127 
0.9787 0.5005 



B u 



0.4711 0.6820 
0.0596 0.0424 



0.0714 0.0967 
0.5216 0.8181 



16 



So, the following picture shows the error plot by using ALS and RALS sepa- 
rately: 




(a) Rank-three (b) Rank-two 



Figure 4 



Notice that the rank-three tensor approximations present no problem in both 



ALS and RALS as seen in Figure 4a However, in Figure [Lb] the rank-two tensor 
approximation requires only 53 iterations RALS (blue line) to reach an error 
within 1CP 5 while ALS needs 27322 iterations as indicated in a swamp. Further 
investigation is needed to understand the degeneracy problems with respect to 
the RALS algorithm. 

5.3 Example III: Induced Rank-Deficiency Swamp 



From the example in Section |3.3| the RALS and ALS are compared. Recall 
that the rank deficiency of the Khatri-Rao products induce an ALS swamp. In 
the Figure |5a[ the error plots show a swamp for ALS with 9707 iterations while 
RALS exhibits no swamp with only 884 iterations. 




1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 3000 9000 10000 

(a) Fifth order tensor (b) Singularity of Khatri-Rao 

Figure 5 

To understand why RALS is not hampered by a swamp, let's look at the 
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normal equation of the subproblem (we have already mentioned in the last 



section (4.3)) 



'(C^QB*)" 


X = 


r t T i 






A fc • (A fc ) T 



where the least squares solution is A k+1 . The submatrix • J_ RxR in (4.3 1 
embeds the range space of (C fc 0B fc ) in a higher dimensional space while induces 
a full rank linear least-squares subproblem. Thus, the regularization keeps the 
cost function strictly component-wise quasiconvex. 



5.4 Example IV: Large Real Datasets 

Since ALS type algorithms have been particularly useful in real large datasets, 
a comparison study of the ALS and RALS algorithms was made on a tensor 
X € jji70x274x35 f rom the paper of Bro et al. [35] in detecting and character- 
izing active photosensitizers in butter. The light exposure experimental data is 
obtained from different colors of light, variation in oxygen availability, and time 
of exposure while measuring the fluorescence EEMs (excitation emission matri- 
ces) and sensory evaluation of the samples. Thus the element Xijk represents 
the fluorescence intensity for sample i at excitation wavelength j and emission 
wavelength k. CP algorithms offer decomposition into factors of sample scores, 
emission loadings and excitation loadings. 

The ALS and RALS algorithms were used to analyze the fluorescence land- 
scapes with rank R = 7. In Figure [6j 100 different random initial starters for 
ALS and RALS were used on tensor X. The relative error (\\X k — X k ^ 1 \\ 2 F ) 
was used as the stopping criterion, but the absolute error (\\X — <^ nna i |||0 was 
measured as well. The table in Figure [6] shows that RALS performed slightly 
better than ALS with respect to both relative and absolute errors as well as the 
number of iterations. 



6 Conclusion 

The RALS method proposed by Navasca, Kindermann and De Lathauwer [35] is 
a numerical technique for the classical problem of solving the CP decomposition 
of a given tensor. We examined the RALS method to find some theoretical expla- 
nations on what we observed numerically. In many instances, several examples 
showed that RALS converges faster than ALS. Moreover, RALS decreases the 
high number of ALS iterations, thereby removing the swamp to some degree. 
Furthermore, our numerical experiments provide us a numerical justification 
that ALS swamping is related to the rank deficiency of the Khatri-Rao prod- 
ucts. This phenomena is not present when the RALS algorithm is implemented. 
Based on these observations, it is important to study the theoretical properties 
of RALS and its differences from ALS. Both the ALS and the RALS are related 
to the GS and the proximal modification of GS (PGS), respectively, by vector- 
izing the three factor matrices in the cost functionals. Using the properties of 
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ALS 


RALS 


rel. error 


1.2717 x KT 4 


1.3842 x lCr 4 


abs. error 


3850.8 


3832.9 


iterations 


3025 


2968 



The comparison of ALS and RALS 



3350 - 




7^ 




3845 - 
3840 - 








3835 - 


RALS 








3830 - 
3825 — 











2960 2970 2980 2990 3000 3010 3020 3030 3040 

Iterations 



Figure 6: A comparison of ALS and RALS for large data set [38], averaging 100 
random initialized runs. 



PGS, we have proved that the limit point of a converging sequence obtained 
from the RALS algorithm is a critical point of the original ALS problem. Some 
difficulties arise when proving the same convergent results for ALS due to the 
lack of strict quasiconvexity. These same difficulties are exhibited numerically 
as swamps. 
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